Stochastic forecasting of variable small data as a basis for analyzing an early stage of a cyber epidemic

Security Information and Event Management (SIEM) technologies play an important role in the architecture of modern cyber protection tools. One of the main scenarios for the use of SIEM is the detection of attacks on protected information infrastructure. Consorting that ISO 27001, NIST SP 800-61, and NIST SP 800-83 standards objectively do not keep up with the evolution of cyber threats, research aimed at forecasting the development of cyber epidemics is relevant. The article proposes a stochastic concept of describing variable small data on the Shannon entropy basis. The core of the concept is the description of small data by linear differential equations with stochastic characteristic parameters. The practical value of the proposed concept is embodied in the method of forecasting the development of a cyber epidemic at an early stage (in conditions of a lack of empirical information). In the context of the research object, the stochastic characteristic parameters of the model are the generation rate, the death rate, and the independent coefficient of variability of the measurement of the initial parameter of the research object. Analytical expressions for estimating the probability distribution densities of these characteristic parameters are proposed. It is assumed that these stochastic parameters of the model are imposed on the intervals, which allows for manipulation of the nature and type of the corresponding functions of the probability distribution densities. The task of finding optimal functions of the probability distribution densities of the characteristic parameters of the model with maximum entropy is formulated. The proposed method allows for generating sets of trajectories of values of characteristic parameters with optimal functions of the probability distribution densities. The example demonstrates both the flexibility and reliability of the proposed concept and method in comparison with the concepts of forecasting numerical series implemented in the base of Matlab functions.

The era of computer viruses lasted little more than 40 years [1][2][3][4][5] .One of the first viruses was developed for an Apple computer.It happened in 1981, and the name of the "progenitor" was Elk Cloner.This virus was not so much harmless as annoying: with each download, the user of the infected computer saw a funny (in the opinion of the cyberbully) poem on the screen, after which the computer worked in normal mode.The first widespread virus for computers running the MS-DOS operating system appeared in 1986 and was called Brain.However, the developers of this virus, Pakistani brothers Farooq Alvi, did not want to harm people: they wrote Brain to protect the medical program they created from unlicensed copying.Computer viruses have come a long way since their inception, and today's malicious programs are much more subtle than their counterparts from the 80s and 90s and are much more difficult to detect.In this regard, computer viruses are very similar to their biological "brothers".Today, users may not notice for years that a program is running on their gadget, which either silently collects information or forces the user's device to perform certain actions, or masks the actions of other, much more dangerous programs.Each type of pest has its name and is intended for attackers to achieve various selfish goals [6][7][8][9][10] .
One of the earliest computer virus epidemics happened as far back as 1988, when the "big worm" or the Morris worm, named after its author, Robert Morris, spread over the Arpanet network in the United States.The worm, picking up passwords, filled the computers of network users with its copies and thus managed to infect more than 6 k computers, causing about 100 million dollars in damages-a colossal amount for those times.Since Any time series of morbidity can be considered as a random process consisting of a signal reflecting the real epidemic situation and high-frequency noise.Noise filtering allows us to refine the prediction and can be performed both during the pre-processing of the raw data and directly in the body of the prediction algorithm.One such approach is wavelet decomposition 35,36 , in which a short time series is represented by wavelet functions.This approach is usually used in conjunction with other models.One such model is exponential smoothing, which is a special case of the weighted moving average, and the incidence value y(t) at time t is described by the weighted sum of the last observations: by(t) + (1 − b)y(t − 1) , where b ∈ (0, 1) is a smoothing factor that provides weight reduction as the data ages, which can be considered as a reflection of the natural learning process.This method of model creation is suitable for series whose behaviour shows a clear trend or seasonality.These conditions for cyber epidemics are fulfilled only in the abstract.
T. Schelling in 1971 and M. Mitchell in 1993 proposed the theory of cellular automata to model the local characteristics of susceptible populations together with stochastic parameters that reflect the probabilistic nature of the development of a biological epidemic.Cellular automata are considered as a set of square cells united in a rectangular grid, each cell of which takes a state from a finite set.Grid nodes model entities-individuals, each of which has a fixed position in space.This approach allows us to focus on the contribution of the human factor to the process of the development of a cyber epidemic.The description of the process of computer network node infection in terms of probabilistic cellular automata and ordinary differential equations has a perspective and will be investigated by the authors in the following works.
Patrolla in 2004 proposed an agent-oriented model 37,38 , which expands the capabilities of cellular automata in the context of tracking the spread of infection, taking into account mutual contacts between individuals united in a certain social group.Such a model is embodied in the scheme of possible contacts as a dynamic or static graph, the vertices of which correspond to objects with a finite, but sufficiently detailed, set of individual properties inherent to individuals or their classes.This is a potentially promising approach in the context of the subject of this article, but it requires the presence of very specific a priori information for its implementation.This fact does not allow the mathematical apparatus of agent-oriented models to claim universality in the contest of the thematics of this research.
Thus, there is no ready universal solution for describing the development of the cyber epidemic.This fact opens up great prospects for scientific research.
Considering the merits and limitations of the aforementioned approaches, we shall now outline the essential characteristics or attributes that scientific research should possess.
The object of study is the process of the development of a cyber epidemic at an early stage.
The subject of study encompasses probability theory and mathematical statistics, information theory, the theory of experiment planning, mathematical programming methods, and numerical methods.
Dear reader, for a more complete understanding of the mathematics-rich material in "Models and methods" section, we recommend that you first read the article 40 , which reveals the theoretical background of the applied research to which this article is devoted.
The aim of the study is to formalize the process of finding optimal functions of probability distribution densities of stochastic characteristic parameters of the variable small data description model with maximum entropy in the context of the problem of forecasting the development of a cyber epidemic at an early stage.
The objectives of the study are: • to formalize the concept of calculating variable entropy estimation for functions derived from probability distribution densities of characteristic parameters within a stochastic model.This model is used to describe variable small data, which is represented by interval normalized probabilities.; • to formalize the process of forecasting the development of cyber epidemics in terms of the stochastic-entropy concept of the description of variable small data; • to justify the adequacy of the proposed mathematical apparatus and demonstrate its functionality with an example.
The main contribution.The article proposes a stochastic concept of describing variable small data on the Shannon entropy basis.The core of the concept is the description of small data by linear differential equations with stochastic characteristic parameters.The practical value of the proposed concept is embodied in the method of forecasting the development of a cyber epidemic at an early stage (in conditions of a lack of empirical information).In the context of the research object, the stochastic characteristic parameters of the model are the generation rate, the death rate, and the independent coefficient of variability of the measurement of the initial parameter of the research object.Analytical expressions for estimating the probability distribution densities of these characteristic parameters are proposed.It is assumed that these stochastic parameters of the model are imposed on the intervals, which allows for manipulation of the nature and type of the corresponding functions of the probability distribution densities.The task of finding optimal functions of the probability distribution densities of the characteristic parameters of the model with maximum entropy is formulated.The proposed method allows for generating sets of trajectories of values of characteristic parameters with optimal functions of the probability distribution densities.
The highlights of the study are: • the instances of the class of parameterized stochastic models for the description of variable small data, • the methods of estimating the functions of the probability distribution densities of their parameters, repre- sented by interval probabilities, • an approach to generating trajectories of random vectors of initial parameters of the model and their statisti- cal processing by the Monte Carlo method to determine numerical characteristics with maximum entropy, • a method of forecasting the development of a cyber epidemic in terms of the stochastic-entropy concept of describing variable small data.

Setting of the research
Let's examine an object with input parameters x(t) = {x i (t)} , output parameters y(t) = y i (t) and parameters ε(t) = {ε i (t)} that characterize the variability of measurements of output parameters, i = 1, n , t ∈ T p = [t 0 , T] , t 0 < T .We describe the object with a dynamic model with input parameters x(t) = {x i (t)} and output param- eters f (t) = f i (t) , i = 1, n , t ∈ T .We define the censored observation interval for the object and the model as T tr = [T − , t e ) ∪ [t e , t 0 ) , where T e = [T − , t e ) is the training data collection interval, and T t = [t e , t 0 ) is the test data collection interval, T e < t e < t 0 .The parameters of the mentioned dynamic model are stochastic values.Characteristic features of this model will be the probability distribution densities of these stochastic parameters.The optimal evaluation of the desired probability distribution densities can be carried out based on data collected at the interval T e .We will use the data collected at the interval T t to test the model.On the interval T p , we will forecast the object-process using the model f (t) → y(t) .Let us formalize the connection between parameters x(t) and f (t) in the form of a system of linear differential equations: The resulting output of the model (1) will be described as Let's formulate the following requirements: R1.The matrix C (f ) is formed by stochastic elements of the interval type where C + are the applied matrices, the elements of which can be both stochastic quantities and linear combinations of a finite number of stochastic quantities; R2.The elements of the matrix C (x) are known and fixed; R3.The probability distribution density P(C) exists, ∀C (f ) ∈ C; R4.Vectors ε(t) , t ∈ T e , are formed by independent components of the interval type: If the conditions R1-R4 are fulfilled, then model (1) allows obtaining a set of trajectories (2) for the stochastic parameter f (t) , t ∈ T e , T t , T p .
Let us rewrite expression (1) taking into account the existence of the fundamental matrix of solutions 39 : Based on expression (5), we write: If the measurement of "input-output" entities is carried out at discrete moments with a step , then at the interval T e the expression (6) will take the form where i ∈ 1, N e , N e = (t e − T − ) � .
Let's rewrite expression (2) taking into account expression (7): where i ∈ [0, N e ] .For compactness, we denote the block vector ε(T − + i�) with dimension n × (N e + 1) men- tioned in expression (8) as ε (e) = ε (k) , k = 0, N e .Taking into account the a priori independence of both vectors ε (e) and their elements, we define the compat- ible probability distribution density as Q ε with the definition domain of E (e) = E N e +1 .
(1) www.nature.com/scientificreports/Therefore, with a defined matrix C, which is characterized by the probability distribution density P(C) , and the vector of the variability of the measurements of the output parameters ε (e) , which is characterized by the compatible probability distribution density Q ε (e) , expression (8) is the basis for obtaining a set of the desired stochastic trajectories o(T

Stochastic concept of the description of variable small data in Shannon entropy basis
We formalize the estimation of the optimal probability distribution densities P * (C) and Q * ε (e) in terms of the stochastic model for variable small data evaluation in the Shannon entropy basis, which the authors presented in 40 .We define the objective function of the optimization problem as We specify the system of limitations of such an optimization problem.The first limitation is obvious and focused on the normalization of the investigated probability distribution densities: The second limitation is focused on ensuring the adequacy of the model to the studied process and is aimed at maintaining a balance between the output parameter of the object y(t) and the output parameter of the model o(t) .Let's formulate this balance equation for the discrete form of representation of the corresponding charac- teristic parameters, that is, for where M o (i) is the first moment of the parameter o (i) (look at the expression (5) in the author's work 40 ), and the parameter w (i) is determined by the expression and C w (i) P(C)dC ≤ 0.5 , E (e) ε (i) Q ε (e) dε (e) ≤ 0.5 and by manipulating the values of w (i) , ε (i) , respectively.
The balance Eq. ( 11) is formulated in the context of the independence of the elements of the vector ε (e) .
The optimization problem with the objective function (9) and limitations (10), (11) can be classified as a global optimization problem 41 .The theory of global optimization summarizes a grand and constantly expanding pleiad of solution methods, which can be most generally segmented into three classes.The methods of the first class are focused on the configuration of the objective function and the set of admissible solutions.A characteristic representative of this class is the concept of DC minimization, in which the objective function and the limitations functions are represented by the differences between two convex functions.The methods of the second class investigate simple admissible sets and objective functions with a known Lipshitz constant.We will especially note the concept of reducing a n-dimensional problem to a 1-dimensional one using Peano curves 41 .The third class of methods is based on the Monte Carlo method with various pseudo-intelligent heuristics 41,42 .In this class of methods, it is necessary to solve the problem of generating uniformly distributed stochastic vectors within the domain of the search space.For this, both numerous modifications of the Hit-and-Run concept 41,42 , as well as concepts based on Markov chains 43 and concepts based on Kullback-Leibner entropy 32,41,42 are used.Further analytical constructions will be formulated based on the methods of the third class.
The optimization problem with the objective function ( 9) and limitations (10), (11) belongs to the Lyapunov type because the functional-objective function and the limitations are integral.Let us analytically express the solution to this problem in terms of the concept of Monte Carlo for global optimization.We will get: where β = β (i) is the solution vector, i = 0, N e , the sign " • " represents the scalar product, and the functions R(β) and Q(β) are defined as Q ε (e) ln Q ε (e) dε (e) → max . ( Q ε (e) dε (e) = 1.
By substituting expression (13) into expression (12), we express the vectors of Lagrange multipliers: The optimal solution β * = β * (i) , i = 0, N e , of the system of Eq. ( 14) coincides with the global extremum of the discrepancy function J(β * ): The achievement of β * marks the completion of training of the model (7) with the stochastic composite parameters C , ε (e) and the corresponding functions of probability distribution densities P * (C) and Q ε (e) , which are determined by the expressions (13).Parallelepiped-like regions of admissible values of the parameters C and ε (e) are defined by expressions (3) and (4), respectively.
We will focus on the application of the trained model of description of variable small data in the Shannon entropy basis for forecasting.Forecasting based on the trained model (7) consists in generating stochastic matrices of parameters C and ε with functions of the probability distribution densities (13) for the interval T p .Let's formalize this process.We move from the matrix to the vector form of the description of the characteristic parameter C .To do this, we will make a serial connection of the rows of the matrix, obtaining a vector α of the length m = n 2 of independent stochastic elements.The domain for the elements of the stochastic vector α will be defined by the m-dimensional parallelepiped A = [α − ≤ α ≤ α + ] , where the vectors α − and α + are the result of the matrix-to-vector transformation of the described above matrices C (f ) − and C (f ) + , respectively.Let us introduce the vectors q that belong to the positive unit cube Q : Q = q : 0 ≤ q ≤ 1 .We connect the vectors α and q by an analytic relation of the form α = q(α + − α − ) + α − .
Based on the above, the optimal probability distribution density P * (C) undergoes a sequence of transforma- tions of the form To generate stochastic vectors q ∈ Q with probability distribution density P q , it is proposed to use the Acceptance--ejection method 42 .This choice is justified by the fact that we assume the rational sufficiency of the procedures for measuring the characteristic parameters of the object at intervals T e , T t .

Forecasting the development of the cyber epidemic in terms of the stochastic-entropy concept of the description of variable small data
Let's take a high-availability cluster 44,45 as an environment for the start and development of a cyber epidemic.Consider the cluster as a closed system.Let's introduce the parameter E(t) , which characterizes the number of infected cluster nodes at a time t .The change in the number of infected nodes will be characterized by the variable v(t) = dE(t) dt .The dynamics of the change in the value of the parameter v(t) are ensured by the combined influence of the streams of generation and death.
The generation flow is characterized by the parameter B (the number of infected cluster nodes per unit of time).Symmetrically, the flow of death is characterized by the parameter M (the number of infected nodes of the cluster that went into a neutral state as a result of the activity of individual defence mechanisms that coped with the cyber infection (hereinafter "disinfected") per unit of time).We emphasize the fact that we are focusing on the early stage of the spread of a new cyber infection when a unified mechanism for its neutralization has not yet been created.We will assume that both of these flows depend linearly on the total number of nodes in the cluster.Let's move on to the relative time dimension of real-time t (this is convenient because information processes in modern cyber-physical systems of high integration are relatively fast): In the time-space defined by expression (17), the development of the cyber epidemic will be determined by a first-order differential equation of the form where b is the relative generation rate (the number of newly infected nodes per time quantum, relative to the total number of nodes), m is the relative death rate (the number of disinfected nodes per time quantum, relative to the total number of nodes).In current differential models of the development of the cyber epidemic, those coefficients are considered constant at certain time intervals.We argue that it is more realistic to define these parameters as interval ones: . This approach allows taking into account the a priori uncertainty inherent in these characteristic parameters.This uncertainty prompts us to interpret the entities b and m as stochastic parameters that take on values in the intervals I b and I m with the compatible function of the probability distribution density P(b, m) and the additive interval variability of the measurements ε = {ε(i)} , i = 0, , where I the number of heuristic antivirus scanning procedures in the quantum of time .Those inde- pendent elements generalized by the stochastic vector ε are characterized by the probability distribution density Q(ε) which is defined on the set Let us analytically express the solution of Eq. ( 18) for τ ∈ T , T = [τ − , τ 0 ] , τ − = − � , τ 0 = t 0 �: By analogy with expression (8), we interpret the change in the number of infected nodes v(t) by taking into account expression (19): where i ∈ [0, I] , and Note that it is a function (21) that causes the individuality of the model (20).For intervals T e , T t , T p , rep- resented by corresponding vectors of measurement results of length N e + 1, N t + 1, N p + 1 , model (20) will take the form.
where the generation rate b and the death rate m are stochastic parameters with the optimal compatible probability distribution density P * (b, m) , determined on the set I b ∪ I m by expression (13) at i ∈ [0, N e ] and by expression ( 16) at i ∈ [0, N t ], 0, N p ; disturbance ε(i�) is a vector whose elements are stochastically independent quantities with probability distribution density Q(ε) , i = 0, I ; parameters E t (0) , E p (0) are constant coefficients that are assigned by experts.
Let's analyze the functions P * (b, m) , Q * (ε) analytically, based on the material of the previous section.
The optimal compatible function of the probability distribution density for the generation coefficient b and the death coefficient m is expressed as where p * j b, m β j = exp −β j � j (b, m|E e (0) ) and The optimal function of the probability density for the variability of measurements of the output characteristic parameters of the object ε is expressed as where q * j ε j� β j = exp −β j ε j� and To determine the Lagrange multipliers, we express the balance Eq. ( 14) in terms of expressions ( 25), (26) for i ∈ [0, N e ]: where Let's open the second term from expression (27).We will get: ( We obtain the roots of Eqs. ( 29) ∀i ∈ [0, N e ] by analogy with expression (15) as a result of minimizing the discrepancy J(β): where || || is interpreted as the Euclidean norm.
The dimensionality of the optimization problem with the objective function ( 30) is equal to N e + 1 .The actual complexity of the function (30) makes further analytical research of its properties impossible.

Results
Experimental studies with the mathematical apparatus proposed in "Models and methods" will begin with the analysis of the focus group of "consumers".In the architecture of modern cyber protection tools, SIEM is undoubtedly such a "consumer" [41][42][43] .Classic SIEM is a log collector that collects events from such sources as DLP systems, firewalls, IPS, servers, workstations, routers, etc. and performs their analysis to detect information security incidents.The main scenarios of using SIEM include the detection of attacks in the early stages, automatic mapping of IT infrastructure, real-time monitoring of the state of IT infrastructure, detection and investigation of information security incidents, detection of new types of threats, optimization of the security monitoring model, etc.At the same time, it is important to understand that SIEM is not a means of protection as such.It is a nested logic and statistics-driven integrator tool for, sometimes, unrelated tools and functions, the purpose of which is to automate end-to-end information security processes.It is rational to implement SIEM if:-≥ 1 k computing devices are accepted in joint organizational activities;-basic means of information protection are implemented and functioning, for example, an antivirus system, UTM and/or IDS\IPS, DLP, Web proxy, etc.;-there is a need to reduce the intervention of the "human factor" in the processes of the information security service;-there is a need to ensure efficiency, reasonableness and integrity of the decision-making process in the field of information security;-it is necessary to ensure compliance of the protected cyberinfrastructure with ISO 27001, NIST SP 800-61 and NIST SP 800-83 standards.
Therefore, the expediency of applying the theoretical results of this research in SIEM is obvious.However, such an applied orientation becomes problematic when it is necessary to find an open dataset for testing the proposed system.These circumstances force us to resort to simulation modelling, the data for which is a generalization of open information about the spread of the Petya encryption virus.On June 27, 2017, victims of this virus were Ukrainian companies and Ukrainian branches of international companies, including Nova Poshta, Zaporizhzhiaoblenergo, Dniproenergo, Oshchadbank, media holding Lux, Mondelēz International, TESA, Nivea, Mars, mobile operators LifeCell, UkrTeleCom, Kyivstar and many others.In Kyiv, in particular, some ATMs and cash terminals in stores were found to be infected.It was in Ukraine that the first attacks were recorded.The authors summarized the information available on the IT community website https:// habr.com regarding the spread of the Petya virus in the form of a dataset visualized in Fig. 1.The ordinate axis is graduated in thousand c.u. and represents the average number of infected computing devices E .The abscissa axis is graduated in time  7, 11] , T tr = T e ∪ T t , T p ∈ [12, 18] .Since variability is characteristic of the E = f (i) measurements, we further take it into account by defining a stochastic vector ε with independent stochastic interval elements ε(i�) ∈ [ε − , ε + ] , i = 0, 18.
We will carry out further calculations by applying the following three sets of intervals for the generation coefficient b and the death coefficient m: The interval for the limits of variation of the model's output parameter is set as E = E j = [−0, 5; 0, 5] ∀j ∈ [0, N e ] , N e = 6 .Let's apply the mathematical apparatus presented in "Models and methods" for the analysis of the output data on the training, test and forecast intervals T e , T t , T p , respectively.
The training interval summarizes the data T e ∈ i = [0, 6] .The residual function (30) contains two integral components that can only be evaluated numerically.For this, a combination of several quadrature formulas, generalized by the Tiled method, implemented in the Matlab engineering software package as a quad2d function, was used.The essence of this method is to divide the area of integration into a set of trapezoidal or rectangular areas.This choice is justified by the fact that the Trust Region method represented in Matlab by the lsqnonlin function was then used to minimize the discrepancy J(β) .Function lsqnonlin optimized for use with a function of the quadratic norm type.The use of the lsqnonlin function J(β) = 10 −3 made it possible to calculate the value of the Lagrange multipliers B = β The results of the calculations are presented in Fig. 2.
The known values of the Lagrange multipliers B make it possible to implement the reverse course and calculate the values of the functions P * = f (I i , b, m) , i = 1, 3 , and Q * = f ε, j , j = 0, N e , N e = 6 , using expressions ( 25) and ( 26), respectively.The calculated dependencies are presented in Figs. 3 and 4. Note that the three-dimensional dependence P * = f (I i , b, m) for ease of perception is presented in 2D projections for the limit values of the characteristic parameters b, m .Boundary values are the limits of intervals for these variables, summarized by sets I i , i = 1, 3.
After training the model ( 18), we will proceed to its testing.The test interval summarizes the data T t ∈ i = [7, 11] (see Fig. 1).The output parameter E (t) is calculated according to the expression (23), where b and m are stochastic parameters with a compatible function of the probability distribution density P * (b, m) and ε(i�) is the stochastic coefficient of variability of the measurements of the output parameter of the object with the functions of the probability distribution density q i (ε(i�)) ∈ Q * , i ∈ [7, 11] .To generate trajectories of sto- chastic parameters b , m , ε(i�) , i ∈ [7, 11] , a 2D adaptation of the Ulam-Neumann exception method 42 with the volume of the generated sample k = 10 5 was used.Each exponential trajectory is determined by a pair of values of stochastic parameters b , m and the value of the stochastic parameter ε(i�) is added to the value of each i -th point of this trajectory by its probability distribution density.The resulting trajectory can no longer be classified as exponential.The only deterministic parameter that affects the set of trajectories is the number of infected computing devices at the initial moment i = 0.
The forecasting results are presented in Fig. 6 by the family of curves E = f (i) , i ∈ [12, 21] = T p .The curve E (p) = f (i) is the averaged trajectory as a result of the description of the interval T p by the model (18) trained on the interval T p at the limit values of the stochastic parameters b and m imposed by the set -0,4 -0,2 0,0 0 ,2 0,4 0 ,6 0,0 0,5  The curve E etalon = f (i) is a visualization of the values of the function E(i) , i = {12 ÷ 15, 17, 19, 21} , from Fig. 1.The curve E Matlab = f (i) demonstrates the result of describing the dependence E(i) , i ∈ [12, 21] , on the technological capabilities of standard Matlab functions in the manner described on the page https:// uk.mathw orks.com/ help/ ident/ ug/ forec asting-preda tor-prey-popul ations.html for the initial data E(i) , i = 0, 10 , from Fig. 1.The curves {E CI+ , E CI− } = f (i) represent the limits of the confidence interval of the variance of the values E (p) = f (i) , i ∈ T p , obtained using the trained model (18).

Discussion
The last decade can without exaggeration be called the "decade of neural networks".Bold experiments with architectures of deep neural networks and their ensembles in combination with the use of Big Data for training allowed us to achieve truly impressive results in solving such classical problems of pattern recognition theory as classification and identification.But have neural networks become smarter?Let's recall the classic flaw of neural networks-overfitting.The essence of this problem is that the neural network model, perceiving only instances from the training sample, adapts to them instead of learning to classify them.Simply put, overfitting is when a neural network in the training process "remembers" the training sample instead of "generalizing" it.In principle, with an infinitely large training sample, the problem of overfitting disappears.But when we talk about the socalled "small data" this postulate does not work.It is when analyzing small data that the problem of overfitting manifests itself in full.When analyzing small data for their classification and identification, one should resort to the methods of machine learning, and not artificial intelligence.This is exactly what the authors did in the context of the task of forecasting the development of a cyber epidemic at an early stage.
Let's take a closer look at the training data, represented in the form of a diagram in Fig. 1.Data visualization instead of a tabular form of their presentation was not chosen by the authors by chance.Figure 1 demonstrates the dynamics of the development of the cyber epidemic of the spread of the Petya encryption virus as it was presented to the general public.We see, in fact, the linear dynamics of the development of this process.Frankly, this immediately raised suspicions among the authors, because intuitively it seems that such a process should develop exponentially until the "cavalry from over the hill" appears in the form of a specialized defence mechanism, which will mark the break of the exponential.But if we start from direct data, then we see linear dynamics.This is exactly what the standard methods of forecasting numerical series, presented in Matlab, "saw" (see curves E Matlab = f (i) in Figs. 5 and 6).And if the volume of the test sample was too small for them, which was reflected in the inaccurate determination of the angle of inclination of the line E Matlab = f (i) relative to the line E etalon = f (i) in Fig. 5, then on Fig. 6, these lines practically coincided.Now let's look at the functions E (t) = f (i) and E (p) = f (i) presented in Figs. 5 and 6, respectively.The function is also linear, which represents the analytical flexibility embedded in the mathematical model (18).At the same time, the values of the function E (t) = f (i) stably prevail over the values of the function E etalon = f (i) , i = 7, 11 .That is, the model ( 18) trained on the data of interval T e "prepares for the worst".Finally, the difference will appear in Fig. 6.The function E (p) = f (i) shows an increasing nonlinear character.How can such results be explained?There are two explanations.Or the trained model ( 18) is inadequate for forecasting the data represented in Fig. 1, or these initial data are incomplete or intentionally distorted.
The authors can reasonably reject the first option.To do this, we recall that the stochastic characteristic parameters b , m , ε(i�) take values from the intervals, the limit ranges of which are embodied in the set of sets I i , i = 1, 3 .Recall that the curves E (t) = f (i) and E (p) = f (i) were obtained under the condition that the values of the parameters b , m , ε(i�) satisfy the set I 2 .Now recall that b and m are stochastic parameters with a compatible function of the probability distribution density P * (b, m) , and ε(i�) is a stochastic coefficient of variability of measurements of the output parameter of the object with functions of the probability distribution densities q i (ε(i�)) ∈ Q * .Let's pay attention to the dependencies P * = f (I 2 , b, m) shown in Fig. 3. Non-linearity is characteristic of this dependence.This is the source of the nonlinearity of the function E (p) = f (i) shown in Fig. 6.The authors did not accidentally define the set I 3 .Let's pay attention to its characteristics in the form of dependencies P * = f (I 3 , b, m) from Fig. 3, both of which have a linear character.The authors trained the model (18) taking into account that its characteristic parameters satisfied the conditions of the set I 3 .In the qualitative metric δ, ξ the obtained result is characterized by the values δ (t) , ξ (t) I 3 = (0, 6676; 0, 0227) , i.e. it prevails over the results obtained using standard Matlab methods (recall: �δ Matlab , ξ Matlab � = (0, 9520; 0, 0321) ).Thus, the functionality of model (18) for solving the problem of forecasting variable small data using the example of forecasting the development of a cyber epidemic of the spread of the Petya encryption virus can be considered proven.The publicly available data on the development of this cyber epidemic was incomplete and the trained model (18) responded differently from the overfitted standard model from the Matlab environment.
It remains to clarify a few more points regarding the material presented in "Results".The first point is the definition of the set I 1 .If you look at its characteristics in the form of dependencies P * = f (I 1 , b, m) from Fig. 3, then it becomes obvious that this set is a compromise between the "nonlinear" set I 2 and the "linear" set I 3 .The authors recommend using the set I 1 if the initial data is difficult to pre-characterize.The second point is the influence of the stochastic coefficient of variability of measurements ε(i�) with functions of the probability distribution densities q i (ε(i�)) ∈ Q * , i ∈ T , on forecasting results.It is impossible to unambiguously answer this question in numerical and parametric form based on the conducted research.This point needs additional investigation in the context of implementing proactive technologies of AI-powered protection of assets against cyberattacks [46][47][48] .However, these aspects do not affect the functionality and adequacy of the material presented in the article.
The essence of the author's method is the idea of estimating the model parameter's probability distributions from a small amount of real empirical data, in the representation of which the measurement noise probability distributions are taken into account.The method returns distributions with maximum entropy, which characterize the state of the greatest uncertainty of the studied process.This makes it possible to interpret the resulting forecasts as the most "negative" ones.This circumstance suggests that the author's method may be appropriate for determining pessimistic scenarios when analyzing the reliability of critical systems in conditions of incomplete or distorted telemetry data.This direction can be developed taking into account the fact that the authors previously proposed a mathematical apparatus for describing the influence of complex negative factors on an information system for critical use based on the Markov processes theory [49][50][51] .

Conclusions
Security Information and Event Management technologies play an important role in the architecture of modern cyber protection tools.One of the main scenarios for the use of SIEM is the detection of attacks on protected information infrastructure.Consorting that ISO 27001, NIST SP 800-61, and NIST SP 800-83 standards objectively do not keep up with the evolution of cyber threats, research aimed at forecasting the development of cyber epidemics is relevant.
The article proposes a stochastic concept of describing variable small data on the Shannon entropy basis.The core of the concept is the description of small data by linear differential equations with stochastic characteristic parameters.The practical value of the proposed concept is embodied in the method of forecasting the development of a cyber epidemic at an early stage (in conditions of a lack of empirical information).In the context of the research object, the stochastic characteristic parameters of the model are the generation rate, the death rate, and the independent coefficient of variability of the measurement of the initial parameter of the research object.Analytical expressions for estimating the probability distribution densities of these characteristic parameters are proposed.It is assumed that these stochastic parameters of the model are imposed on the intervals, which allows for manipulation of the nature and type of the corresponding functions of the probability distribution densities.The task of finding optimal functions of the probability distribution densities of the characteristic parameters of the model with maximum entropy is formulated.The proposed method allows for generating sets of trajectories of values of characteristic parameters with optimal functions of the probability distribution densities.The example demonstrates both the flexibility and reliability of the proposed concept and method in comparison with the concepts of forecasting numerical series implemented in the base of Matlab functions.
The authors see the direction of further research in deepening the understanding of the influence of the variability of measurements of the output parameter of the research object on the results of evaluation and forecasting of small data.This direction could be added by enhancing protection means against AI-powered attacks 52,53 . https://doi.org/10.1038/s41598-023-49007-2 https://doi.org/10.1038/s41598-023-49007-2www.nature.com/scientificreports/Q(ε) = I j=0 q j ε j� .

Figure 6 .
Figure 6.Visualization of a family of curves.